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1  Introduction 


Learning  an  input-output  relation  from  examples  can  be  considered  as  the 
problem  of  approximating  an  unknown  function  f(x)  from  a  set  of  sparse 
data  points  (Poggio  and  Girosi,  1989).  From  this  point  of  view,  feedforward 
networks  are  equivalent  to  a  parametric  approximating  function  F(W,x).  As 
an  example,  consider  a  feedforward  network,  of  the  backpropagation  type, 
with  one  hidden  layer;  the  vector  W  corresponds,  then,  to  the  two  sets  of 
“weights,”  from  the  input  to  the  hidden  layer,  and  from  the  hidden  layer 
to  the  output.  Even  before  considering  the  problem  of  how  to  find  the  ap¬ 
propriate  values  of  W  for  the  set  of  data,  the  fundamental  representational 
problem  must  be  approached:  which  class  of  mappings  /  can  be  approxi¬ 
mated  by  F,  and  how  well?  The  neural  network  field  has  recently  seen  an 
increasing  awareness  of  this  problem.  Several  results  have  been  published, 
all  showing  that  backpropagation  networks  of  different  form  and  complex¬ 
ity  can  approximate  arbitrarily  well  a  continuous  function,  provided  that 
an  arbitrarily  large  number  of  units  is  available  (Cybenko,  1989;  Funahashi, 
1989;  Moore  and  Poggio,  1988;  Stinchcombe  and  White,  1989;  Carrol  and 
Dickinson,  1989).  This  property  is  shared  by  algebraic  and  trigonometric 
polynomials,  as  is  shown  by  the  classical  Weierstrass  Theorem,  and  for  this 
reason  we  shall  refer  to  it  as  the  Weierstrass  property.  It  is  important,  how¬ 
ever,  to  realize  that  results  of  this  type  should  not  be  taken  to  mean  that 
the  approximation  scheme  is  a  “good”  approximation  scheme.  An  indication 
of  the  latter  point  is  provided,  in  the  case  of  backpropagation,  by  a  closer 
look  at  the  published  results.  Taken  together,  they  imply  that  almost  any 
nonlinearity  at  the  hidden  layer  level  and  a  variety  of  different  architectures 
(one  or  more  hidden  layers,  for  instance)  insures  the  Weierstrass  property 
(Funahashi,  1989;  Cybenko,  1989;  Stinchcombe  and  White,  1989).  There 
is  nothing  special  about  sigmoids,  and  in  fact  many  classical  approximation 
schemes  exist  that  can  be  represented  as  a  network  with  a  hidden  layer  and 
that  exhibit  the  Weierstrass  property.  In  a  sense  this  property  is  not  very 
useful  for  characterizing  approximation  schemes,  since  many  schemes  have  it. 
Literature  in  the  field  of  approximation  theory  reflects  this  situation,  since 
it  emphasizes  other  properties  in  characterizing  approximation  schemes.  In 
particular,  a  critical  concept  is  that  of  best  approximation.  An  approximation 
scheme  has  the  best  approximation  property  if  in  the  set  A  of  approximating 
functions  (for  instance  the  set  F(W,  x)  spanned  by  parameters  W )  there  is 
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one  that  has  minimum  distance  from  any  given  function  of  a  larger  set  $  (a 
more  formal  definition  is  given  later).  Several  questions  can  be  asked,  such 
as  the  existence,  uniqueness,  computability,  etc.,  of  the  best  approximation. 

In  this  paper,  we  show  that  feedforward  multilayer  networks  of  the  back- 
propagation  type  (Rumelhart  et  al.,  1986,  1986a;  Sejnowski  and  Rosenberg, 
1987)  do  not  have  the  best  approximation  property  for  the  class  of  contin¬ 
uous  functions  defined  on  a  subset  of  Rn.  On  the  other  hand,  we  prove 
that  for  networks  derived  from  regularization,  and  in  particular  for  radial 
basis  function  networks,  best  approximation  exists  and  is  unique.  We  also 
prove  that  these  networks  approximate  arbitrarily  well  continuous  functions 
(see  Appendix  B  and  C).  We  have  recently  shown  that  radial  basis  function 
approximation  schemes  can  be  derived  from  regularization  and  are  there¬ 
fore  equivalent  to  generalized  (radial)  splines  (Poggio  and  Girosi,  1989).  For 
Radial  Basis  Function  networks  we  prove  existence  and  uniqueness  of  best 
approximation.1 

The  plan  of  the  paper  is  as  follows.  We  first  formalize  the  previous  argu¬ 
ments,  then  introduce  some  basic  notions  from  approximation  theory.  Next, 
we  prove  that  multilayer  networks  of  the  backpropagation  type  do  not  have 
the  best  approximation  property,  and  that  networks  obtained  from  regular¬ 
ization  theory  have  this  property.  In  the  last  section,  we  discuss  the  im¬ 
plications  of  these  results  and  list  some  open  questions.  Appendix  B  proves 
that  the  Stone- Weierstrass  theorem  holds  for  Gaussian  Radial  Basis  Function 
networks  (with  different  variances).  In  appendix  C  we  prove  a  more  general 
result:  regularization  networks  approximate  arbitrarily  well  any  continuous 
function  on  a  compact  subset  of  R. 

2  Most  networks  approximate  continuous  func 
tions 

In  recent  years  there  have  been  attempts  to  find  a  mathematical  justification 
for  the  use  of  feedforward  multilayer  networks  of  the  backpropagation  type. 
Typical  results  deal  with  the  possibility,  given  a  network,  of  approximating 

lrThe  theory  has  been  extended  by  introducing  the  more  general  schemes  of  GRBF  and 
HyperBF,  which  can  be  considered  as  the  network  equivalent  of  generalized  multidimen¬ 
sional  splines  with  free  knots. 
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any  continuous  function  arbitrarily  well.  In  mathematical  terms  this  means 
that  the  set  of  functions  that  can  be  computed  by  the  network  is  dense  (see 
Appendix  A)  in  the  space  of  the  continous  functions  C[U\  defined  on  some 
subset  U  of  Rd.  The  most  recent  results  (Cybenko,  1989;  Funahashi,  1989; 
Stinchcombe  and  White,  1989)  consider  networks  with  just  one  layer  of  hid¬ 
den  units,  that  correspond  to  the  following  class  of  approximating  functions: 


S  =  {/  €  C[U]  |  /(x)  =  J2c,-ff(x-Wi+0i),l/  C  €  R,m€  N} 

12*1 

(i) 

where  a  is  a  continuous  function.  Depending  on  <r,  the  set  E  may  or  may  not 
be  dense  in  the  space  of  the  continuous  functions.  The  set  2>  of  functions  <r 
such  that  E  is  dense  seems  to  be  large.  For  instance,  the  sigmoidal  functions, 
that  is  functions  such  that 


lim  <r(<)  =  1 
lim  ff(t)  =  0 

*oo 

belong  to  V  (Cybenko,  1989;  Funahashi,  1989).  Many  other  types  of  func¬ 
tions  in  Z>  can  be  found  in  the  paper  of  Cybenko  (1989).  The  set  V  has  been 
recently  extended  by  the  result  of  Stinchcombe  and  White  (1989).  In  fact 
they  prove  that  it  contains  all  the  functions  whose  mean  value  is  different 
from  zero  and  whose  Lp-norm  is  finite  for  1  <  p  <  oo 

Other  networks  can  be  built,  such  that  the  corresponding  set  of  approx¬ 
imating  functions  is  dense  in  C[C/].  Consider  for  example  the  network  in 
figure  1.  This  is  the  most  general  network  with  one  layer  of  hidden  units, 
and  the  class  of  approximating  functions  corresponding  to  it  is 


#  =  {/  €  C[U]\f(x)  =  '£ciHi(x)tU  C  Rd,Hi  £  C[U)}m£  N}.  (2) 

1*1 

The  function  Hi  are  of  the  form  Ht  =  H(x;  W,),  where  W,  is  a  vector  of 
unknown  parameters  in  some  multidimensional  space  and  H  is  a  continu¬ 
ous  function.  If  the  Hi  are  appropriately  chosen  the  set  M  can  be  dense  in 
C[U).  For  example  the  Hi  could  be  algebric  or  trigonometric  polynomials, 
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X  Y 


Figure  1:  The  most  general  network  with  one  layer  of  hidden  units.  Here 
we  show  the  two-dimensional  case,  in  which  x  =  (x,  y).  Each  function  Hi 
can  depend  on  a  set  of  unknown  parameters,  that  are  computed  during  the 
learning  phase,  as  well  as  the  coefficients  c,.  When  Hi  =  <r(x  •  w,  +  0,)  a 
network  of  the  backpropagation  type  is  recovered,  while  Hi  —  tf(||x  —  t,j|) 
corresponds  to  RBF  or  GRBF  scheme  (Broomhead  and  Lowe,  1988;  Poggio 
and  Girosi,  1989). 
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and  in  this  case  the  denseness  of  M  would  be  a  trivial  consequence  of  the 
Stone- Weierstrass  theorem  (see  Appendix  B).  This  theorem  allows  a  signifi¬ 
cant  extension  of  the  set  of  “basis”  functions  Hi.  Appendix  B  gives  another 
example,  showing  how  Gaussian  functions  of  radial  argument  (and  different 
variances)  can  be  used  to  approximate  any  continuous  function.  Appendix 
C  provides  a  more  powerful  result  showing  that  all  networks  derived  from 
regularization  theory  can  approximate  arbitrarily  well  continuous  functions 
on  a  compact  subset  of  FT.  This  result  includes,  in  particular,  Radial  Basis 
Functions  networks  with  the  radial  basis  function  being  the  Green’s  function 
of  a  self-adjoint  differential  operator  associated  to  the  Tikhonov  stabilizer. 
Such  Green’s  functions  include  most  of  the  known  approximation  schemes, 
such  as  the  Gaussian  and  several  types  of  splines  and  many  functions,  but 
not  all  functions,  that  satisfy  some  sufficient  conditions  given  by  Micchelli 
(1986)  in  order  to  be  interpolating  functions. 

Since  a  large  number  of  networks  can  approximate  arbitrarily  well  any 
continuous  functions,  it  is  natural  to  ask  whether  this  property  is  really 
important  from  the  point  of  view  of  approximation  theory,  and  whether  other 
more  fundamental  properties  can  be  characterized.  As  we  mentioned  already, 
one  of  the  basic  properties  that  an  approximating  set  should  have  is  the  best 
approximation  property,  that  guarantees  that  the  approximation  problem 
has  a  solution.  The  next  section  focuses  our  attention  on  the  relationship 
between  this  property  and  different  kind  of  networks,  since  this  seems  to  be 
a  more  appropriate  starting  point  for  a  complete  analysis  of  the  networks 
performances  from  a  rigorous  mathematical  point  of  view. 

3  Basic  facts  in  approximation  theory 

3.1  The  best  approximation  property 

An  informal  formulation  of  the  approximation  problem  can  be  stated  as  fol¬ 
lows:  given  a  function  f  belonging  to  some  prescribed  set  of  functions  and 
given  a  subset  A  of  find  the  element  a  of  A  that  is  the  “ closest ”  to  f. 

In  order  to  give  this  formulation  a  precise  mathematical  meaning,  some 
definitions  are  needed.  First  of  all  a  notion  of  “distance”  has  to  be  introduced 
on  the  set  $.  Since  this  set  is  usually  assumed  to  be  a  normed  linear  space, 
with  norm  indicated  by  ||  •  ||,  the  distance  d(f,g)  between  two  elements  / 
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and  g  of  $  is  naturally  defined  as  \\f  —  ^||.  Given  /  6  $  and  A  C  $  we  can 
now  define  the  distance  of  f  from,  A  as 

d(f.  A)  =  inf  ||/  —  a||.  (3) 

If  the  infimum  of  ||/  —  a ||  is  attained  for  some  element  a0  of  A,  that  is  if 
there  exists  an  ao  €  A  such  that  ||/  —  a0||  =  d(f,A),  this  element  is  said 
to  be  a  best  approximation  to  f  from  A.  A  set  A  is  called  an  existence  set 
( uniqueness  set ,  resp.)  if,  to  each  /  €  $,  there  is  at  least  (at  most,  resp.) 
one  best  approximation  to  /  from  A.  If  the  set  A  is  an  existence  set  we  will 
also  say  that  it  has  the  best  approximation  property.  A  set  A  is  called  a 
Tchebycheff  set  if  it  is  am  existence  set  and  a  uniqueness  set.  We  are  now 
ready  to  give  a  precise  formulation  of  the  approximation  problem: 

Approximation  problem:  given  /  €  $  and  Ac$  find  a  best  approx¬ 
imation  to  /  from  A. 

From  the  definition  above  it  is  clear  that  the  approximation  problem  has 
a  solution  if  and  only  if  A  is  an  existence  set,  and  a  large  part  of  approxi¬ 
mation  theory  has  been  devoted  to  proving  existence  theorems,  which  give 
sufficient  conditions  to  guarantee  existence  and  possibly  uniqueness  of  closest 
points.  We  will  only  present  very  simple  properties  of  sets  with  the  best  ap¬ 
proximation  property,  and  will  apply  these  result  to  network  architectures,  in 
order  to  understand  their  properties  from  the  point  of  view  of  approximation 
theory. 

We  begin  with  the  following  observation: 

Proposition  3.1  Every  existence  set  is  closed. 

Proof.  Let  A  C  $  be  an  existence  set,  and  suppose  that  it  is  not  closed. 
Then  there  is  a  sequence  {an}  of  elements  of  A  that  converges  to  an  element 
/  that  is  not  in  A ,  that  is  there  exists  an  /  €  $\A  such  that 

lim  </(/,  an)  =  0 

n— *oo 

This  means  that  d(/,  A)  =  0,  and  since  A  is  an  existence  set  there  is  an 
element  ao  €  A  such  that  ||/  —  a0||  =  0.  By  the  properties  of  the  norm  this 
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implies  that  /  =  ao,  which  is  absurd  because  f  &  A  and  ao  G  A.  Then  A 
must  be  closed.  □ 

The  converse  of  this  proposition  is  not  true,  that  is  closedness  is  not 
sufficient  for  a  set  to  be  an  existence  set.  However  the  stronger  condition  of 
compactness  is  sufficient,  as  the  following  theorem  shows. 

Theorem  3.1  Let  A  be  a  compact  set  in  a  metric  space  $.  Then  A  is  an 
existence  set. 

Proof.  For  each  /  €  $  the  distance  d(/,  a),  with  a  €  A,  is  a  continuous 
real  valued  function  defined  on  the  compact  set  A.  From  theorem  A. 2  of 
Appendix  A  it  attains  its  maximum  and  minimum  value  on  this  set  and  this 
concludes  the  proof.  □ 

In  the  next  section  we  apply  these  simple  results  to  some  network  archi¬ 
tectures. 


4  Networks  and  approximation  theory 

From  the  point  of  view  of  approximation  theory  a  feedforward  network  is  a 
representation  of  a  set  A  of  parametric  functions,  and  the  learning  algorithm 
corresponds  to  the  search  of  the  best  approximation  to  some  target  function 
/  from  A.  Since  in  general  a  best  approximation  does  not  exist  unless  the 
set  A  has  some  properties  (see,  for  instance,  theorem  3.1),  it  is  of  interest  to 
understand  which  classes  of  networks  have  these  properties. 

4.1  Backpropagation  does  not  have  the  best  approx¬ 
imation  property 

Here  we  consider  the  class  of  networks  of  the  backpropagation  type  with  one 
layer  of  hidden  units.  The  space  $  of  functions  that  have  to  be  approximated 
is  chosen  to  be  C[U\,  the  set  of  continuous  functions  defined  on  a  subset  U 
of  Rf  with  some  unspecified  norm.  If  the  number  of  hidden  units  is  m,  the 
functions  that  can  be  computed  by  such  networks  belong  to  the  following  set 
am: 


<r”  =  {/  €  C[U\  |  /(x)  =  Y.  <5*(x  •  Wi  +  0i).  w,  €  K1,  Ci,  t, i  €  (4) 

1=1 
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where  <r(x )  is  usually  a  sigmoidal  function.  We  now  show  that  am  is  not 
an  existence  set,  and  this  does  not  the  depend  on  the  norm  that  has  been 
chosen.  The  result  is  proved  in  the  case  of  a  being  a  sigmoid  and  for  one 
hidden  layer,  o(x)  =  (1  4-  e-r)-1,  but  it  holds  for  every  other  non  trivial 
choice  of  nonlinear  function  and  for  networks  with  more  than  one  hidden 
layer. 

Proposition  4.1  The  set  <?m  is  not  an  existence  set  for  m  >  2. 

Proof:  A  necessary  condition  for  a  set  to  be  an  existence  set  is  to  be 
closed.  Therefore  it  is  sufficient  to  show  that  am  is  not  closed,  and  this  can 
be  done  by  showing  an  accumulation  point  that  does  not  belong  to  it.  Let 
us  consider  the  following  function: 

/«(x)  =  1  (i  +  e-[woc+e}  “  i  +  e-[w.x+(f>+*)] 

Clearly  fs  €  o m,Vm  >  2,  but  it  easily  seen  that 

IhnMx)  =  ,(x)  =  2(r+cosh|w,JC  +  g|) 

and  g  &  <rm,  Vm  >  2.  For  each  m  >  2  the  function  g  is  then  an  accumulation 
point  of  <7m  but  does  not  belong  to  it:  <rm  can  not  be  closed  and  this  concludes 
the  proof.  □ 

This  result  reflects  a  general  fact  in  non  linear  approximation  theory:  usually 
the  set  of  approximating  functions  is  not  closed,  and  its  closure  must  be  added 
to  it  in  order  to  obtain  an  existence  set.  This  is  the  case,  for  instance,  for  the 
approximation  by  7- polynomials  in  one  dimension,  that  are  replaced  by  the 
extended  7 -polynomials,  to  guarantee  the  existence  of  a  best  approximating 
element  (Braess,  1986;  Rice,  1964,  1969;  Hobby  and  Rice,  1967;  De  Boor, 
1969). 


4.2  Existence  and  uniqueness  of  best  approximation 
for  regularization  and  RBF 

One  of  the  possible  approaches  to  the  problem  of  surface  reconstruction  is 
given  by  regularization  theory  (Tikhonov  and  Arsenin,  1977;  Bertero  et  al. 
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1988).  Poggio  and  Girosi  (1989)  have  shown  that  the  solution  obtained  by 
means  of  this  method  maps  into  a  class  of  networks  with  one  hidden  layer 
(an  instance  of  which  are  Radial  Basis  Function  networks  or  RBF).  In  fact 
the  solution  can  always  be  written  in  the  parametric  form: 

/(x)  =  f>t&(x)  (5) 

i=i 

where  the  c,-  are  unknown,  m  is  the  nuniber  of  data  points  and  the  <fi  are 
fixed,  depending  on  the  nature  of  the  problem  and  on  the  data  points.  More 
precisely  the  “basis  function”  4>i  is  of  the  form  <^(x)  =  G(x;  x,),  where  x, 
is  a  data  point  and  G  is  the  Green’s  function  of  some  (pseudo)differential 
operator  P  (a  term  belonging  to  the  null  space  of  P  can  also  appear,  see 
Appendix  C).  In  the  particular  case  of  radial  function  G  =  G(||x  —  x,||)  the 
RBF  method  is  recovered,  and  the  solution  of  the  approximation  problem  is 
then  a  linear  superposition  of  radial  Green’s  functions  G  “centered”  on  the 
data  points. 

Notice  that  this  function  can  be  computed  by  a  network  that  is  a  special 
case  of  the  one  represented  in  figure  1.  The  main  difference  is  'hat  in  the 
general  case  the  functions  G,  depend  on  unknown  parameters,  while  in  the 
regularization  context  only  the  coefficient  q  are  unknown. 

Equation  5  means  that  the  approximated  solution  belongs  to  the  subset 
T™  of  C[U\: 


7"  =  {/ €  C|l/]  | /(X)  =  2>«x)lCi  €  fl}  (6) 

1=1 

Since  we  have  shown  that  the  set  of  approximating  functions  associated  with 
networks  with  one  hidden  layer  of  the  backpropagation  type  does  not  have 
the  best  approximation  property,  it  is  natural  to  ask  whether  or  not  the  set 
Tm  has  this  property  J.  The  answer  is  positive,  as  is  stated  in  the  following 
proposition: 

Proposition  4.2  The  set  Tm  is  an  existence  set  for  m  >  1 

2 Notice  that  backpropagation  cannot  be  derived  from  any  regularization  scheme  since 
it  cannot  be  written  as  the  linear  superposition  of  Green’s  functions  of  any  kind. 
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Proof.  Let  /  be  a  prescribed  element  of  C[U],  and  let  ao  be  an  arbitrary 
point  of  Tm.  We  are  looking  for  the  closest  point  to  /  in  Tm.  It  has  to  lie  in 
the  set 


{aer-  |||a-/||<||«0-/||}. 

This  set  is  clearly  closed  and  bounded,  and  by  theorem  A.l  it  is  compact. 
The  best  approximation  property  comes  from  theorem  3.1.  □ 

From  this  proposition  we  can  see  that  every  time  that  the  approximat¬ 
ing  function  is  a  finite  linear  combination  of  basis  functions,  the  set  that  is 
spanned  by  these  basis  functions  is  an  existence  set  for  C[U\.  Depending 
on  the  norm  that  is  chosen  in  C[U\  the  best  approximating  element  can  be 
unique.  In  fact  the  following  theorem  holds  (see  Appendix  A  for  the  definition 
of  strictly  convex ): 

Proposition  4.3  The  set  Tm ,  m  >  1  is  a  Tchebycheff  set  if  the  normed 
space  C[U]  is  strictly  convex. 

Proof.  The  existence  has  already  been  proved.  Suppose  then  that  there  are 
two  best  approximating  elements  /  and  /'  from  Tm  to  a  function  g  £  C[U). 
Let  A  be  the  distance  of  g  from  Tm.  Applying  the  triangular  inequality  we 
obtain  : 


II 5</  +  /')  -  ffll  <  5II/  -  *11  +  \\\f  -  Sll  =  A  (7) 


Since  Tm  is  a  vector  space,  then  |(/  -f  /')  €  Tm  and  by  definition  of  A  it 
follows  that  ||j(/+ /')||  >  A.  This  implies  that  the  equality  holds  in  equation 
7.  If  A  =  0  it  is  clear  that  /  =  /'  =  g.  If  A  ^  0,  then  we  can  write  equation 
7  as 


( f~9 )  ,  (/'-<?) 


=  1. 


(8) 


This  means  that  the  vectors  and  their  midpoints  are  all  of  norm  1, 

but  since  stricty  convexity  holds,  then  /  =  /'.□ 

Since  it  is  well  known  that  C[U]  with  the  Lp-norms,  1  <  p  <  oo  is  strictly 
convex  (Rice,  1964),  we  have  then  shown  that  in  most  cases  regularization 
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theory  gives  an  approximating  set  with  the  best  approximation  property  and 
with  a  unique  best  approximating  element. 


5  Conclusions 

5.1  GRBF  and  Best  Approximation 

We  have  recently  extended  the  scheme  of  equation  5  to  the  case  in  which 
the  number  of  basis  functions  is  less  than  the  number  of  data  points  (Poggio 
and  Girosi,  1989;  Broomhead  and  Lowe,  1988).  The  reason  for  this  is  that 
when  the  number  of  data  points  becomes  large  the  complexity  of  the  network 
may  become  too  high,  being  proportional  to  the  number  of  data  points.  A 
solution  to  the  approximation  problem  is  sought  of  the  form: 

/(x)  =  £  t.)  (9) 

i=i 

where  n  is  smaller  than  the  number  of  data  points  and  the  positions  of  the 
“centers”  t,  of  the  expansion  are  unknown,  having  to  be  found  during  the 
learning  stage.  Does  the  best  approximation  property  hold  for  this  approx¬ 
imation  scheme,  that  we  call  Generalized  Radial  Basis  Function  (GBRF) 
method?  The  answer  is  no,  exactly  as  for  splines  with  free  knots,  to  which 
equation  9  is  in  fact  equivalent.  By  the  same  arguments  we  have  used  in 
section  4.1  we  could  show  that  the  set  Gn  of  approximating  functions  gener¬ 
ated  by  equation  9  (the  analogous  of  the  set  7 ”")  is  not  closed.  The  scheme, 
however,  has  almost  the  best  approximation  property  in  the  following  sense. 
The  scheme  already  works  satisfactorily  if  the  centers  t,  are  fixed  to  a  subset 
of  examples  or  other  positions.  In  this  case  Gn  is  a  linear  space,  and  it  is 
an  existence  set,  as  well  as  Tm.  We  could  then  have  an  algorithm  in  which 
first  the  centers  are  found  independently  (for  instance  by  the  K-means  al¬ 
gorithm,  see  Moody  and  Darken,  1989)  and  then  the  c\  are  obtained  with 
gradient  descent  methods  (see  Poggio  and  Girosi,  1989).  In  this  scheme  the 
best  approximation  property  is  preserved,  while  the  computational  complex¬ 
ity  has  been  reduced  with  respect  to  the  exact  solution  of  the  regularization 
problem. 

There  are  other  ways  to  make  GRBF  a  best  approximation.  The  most 
interesting  approach  is  to  follow  the  theory  of  7-polynomials  (Braess,  1986; 
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Rice,  1964,  1969;  Hobby  and  Rice,  1967;  De  Boor,  1969)  and  complete  the 
sets  of  basis  functions  with  its  closure,  consisting  of  an  appropriate  number 
of  derivatives  of  the  Green’s  function  with  respect  to  its  parameters,  yielding 
a  best  approximation  scheme.  It  seems  very  difficult  to  use  either  of  these 
two  approaches  for  backpropagation  networks. 

5.2  Open  Questions 

We  have  not  explored  the  practical  consequences  of  the  fact  that  backprop¬ 
agation  is  not  best  approximation.  Intuitively,  it  seems  that  the  lack  of  the 
best  approximation  property  is  related  to  possible  practical  degeneracies  of 
the  solution.  In  certain  situations,  because  of  the  fact  that  the  sigmoid, 
which  is  asymptotically  constant,  contains  as  an  argument  one  set  of  pa¬ 
rameters  (the  in,),  the  precise  values  of  these  parameters  may  not  have  any 
significant  effect  on  the  output  of  the  network.  The  same  situation  happens 
for  GRBF  when  the  centers  inside  the  Green’s  function  are  unknown.  In  the 
GRBF  case,  however,  we  can  freeze  the  t,-  to  reasonable  values  whereas  this 
is  impossible  in  the  backpropagation  case. 

Other  questions  remain  open  as  well.  The  most  important  questions 
from  the  viewpoint  of  approximation  theory  are:  (1)  the  computation  of  the 
best  approximation,  i.e.,  which  algorithm  to  use,  (2)  a  priori  bounds  on  the 
goodness  of  the  approximation  given  some  generic  information  on  the  class 
of  functions  to  be  approximated,  and  (3)  a  priori  estimates  of  the  complexity 
of  the  best  approximation,  again  given  generic  information  on  the  class  of 
functions  to  be  approximated.  In  the  case  of  RBF,  the  latter  question  is 
directly  related  to  the  size  of  the  required  training  set,  and  therefore  to  the 
deep  issue  of  sample  complexity  (see  Poggio  and  Girosi,  1989,  section  9.3). 
About  problems  1)  and  2)  notice  that  in  practical  cases  it  may  be  admissible 
to  use  a  scheme  which  is  not  best  approximation,  if  it  provides  an  almost  as 
good  approximation  at  a  much  lower  computational  cost. 

Acknowledgments  We  are  grateful  to  G.  Palm  and  E.  Grimson  for 
useful  suggestions. 
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A  Definitions  and  basic  theorems 


We  review  here  some  of  the  definitions  that  have  been  used  in  the  paper. 
Every  set  will  be  assumed  to  have  the  structure  of  metric  space,  unless  dif¬ 
ferently  specified,  and  the  concepts  of  limit  point,  infimum  and  supremum 
a re  assumed  to  be  known.  All  these  definitions  and  theorems  can  be  found 
in  any  standard  text  on  functional  analysis  (Yosida,  1974;  Rudm,  1973)  and 
in  many  books  on  approximation  theory  (Braess,  1986;  Cheney,  1981). 

An  important  concept  is  that  of  closure : 

Definition  A.l  If  S  is  a  set  of  elements,  then  by  the  closure  [5]  of  S  we 
mean  the  set  of  all  points  in  S  together  with  the  set  of  all  limit  points  of  S. 

We  can  now  define  the  closed  sets  as  following: 

Definition  A. 2  A  set  S  is  closed  if  it  is  coincident  with  its  closure  [5], 

A  closed  set  then  contains  all  its  limit  points.  Another  important  definition 
related  to  the  concept  of  closure  is  that  of  dense  sets: 

Definition  A.3  Let  T  a  subset  of  the  set  S.  T  is  dense  in  S  i/[7]  =  S. 

If  T  is  dense  in  S  then  each  element  of  S  can  be  approximated  arbitrarily  well 
by  elements  of  T.  As  an  example  we  mention  the  set  of  rational  numbers, 
that  is  dense  in  the  set  of  real  numbers,  and  the  set  of  polynomials  that  is 
dense  in  the  space  of  continuous  functions  (see  appendix  B). 

In  order  to  extend  some  properties  of  the  real  valued  functions  defined  on 
an  interval  to  real  valued  functions  defined  on  more  complex  metric  spaces 
it  is  fundamental  to  define  the  compact  sets: 

Definition  A.4  A  compact  set  is  one  in  which  every  infinite  subset  contains 
at  least  one  limit  point. 

It  can  be  shown  that,  in  finite  dimensional  metric  spaces,  there  exists  a  simple 
characterization  of  compacts  sets.  In  fact  the  following  theorem  holds: 

Theorem  A.l  Every  closed,  bounded,  finite- dimensional  set  in  a  metric  lin¬ 
ear  space  is  compact. 
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The  well  known  Weierstrass  theorem  on  the  attainment  of  the  extrema  of  a 
continuous  function  on  an  interval  can  now  be  extended  as  following: 

Theorem  A. 2  A  continuous  real  valued  function  defined  on  a  compact  set 
in  a  metric  space  achieves  its  infimum  and  supremum  on  that  set. 

A  subset  of  the  metric  spaces  is  given  by  the  normed  spaces,  and  among  the 
normed  spaces,  a  special  role  is  played  by  the  strictly  convex  spaces: 

Definition  A. 5  A  normed  space  is  strictly  convex  if: 

ll/ll  =  Ml  =  II  j(/  +  s)ll  “!*»/“? 

The  geometrical  interpretation  of  this  definition  is  that  a  space  is  strictly 
convex  if  the  unit  sphere  does  not  contain  any  line  segment  on  its  surface. 

B  Gaussian  networks  and  Stone’s  theorem 

It  has  been  proved  (Cybenko,  1989;  Funahashi,  1989)  that  a  network  with  a 
one  hidden  layer  of  sigmoidal  units  can  approximate  a  continuous  function 
arbitrarily  well.  Here  we  show  that  this  property,  which  is  well  known  for 
algebraic  and  trigonometric  polynomial  approximation  schemes,  is  shared  by 
a  network  with  Gaussian  hidden  units.  The  proof  is  a  simple  application  of 
the  Stone- Weierstrass  theorem,  which  is  the  generalization  given  by  Stone  of 
the  Weierstrass  approximation  theorem  (Stone,  1937,  1948).  Our  result  was 
obtained  independently  from  the  equivalent  proof  of  Hartman,  Keeler  and 
Kowalski  (1989).  We  first  need  the  definitions  of  algebra. 

Definition  B.l  An  algebra  is  a  set  of  elements  denoted  by  y,  together  with 
a  scalar  field  T,  which  is  closed  under  the  binary  operators  of  +  ( addition 
between  elements  of  y),  x  (multiplication  of  elements  of  y ) ,  •  (multipli¬ 
cation  of  elements  in  y  by  dements  from  the  scalar  field  T),  such  that 

1 .  y  together  with  F,  +  and  •  forms  a  linear  space, 

2.  ft  9 1  A  are  in  y,  a  is  in  fF ,  then 

a.  f  x  g  is  in  y, 
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b.  f  x  (g  x  h)  =  (/  x  g)  x  h, 

c.  fx(g  +  h)  =  fxg  +  fxh, 

*■  i f  +  g)xh  =  fxh  +  gxh , 

e.  a  {f  xg)  =  ( af )  xjs/x  (ag ). 

It  is  an  elementary  calculation  to  show  that  if  V  is  some  subsect  of  Ft?  then 
C[U ]  is  an  algebra  with  respect  to  the  scalar  field  R.  We  can  now  define  a 
subalgebra  as  following: 

Definition  B.2  A  set  S  is  a  subalgebra  of  the  algebra  y  if 

ft'  , 

1.  S  is  a  linear  subspace  ofy, 

2.  S  is  closed  under  the  operation  x  .  That  is,  if  f  and  g  are  in  S 
then  f  x  g  is  also  in  S. 

We  can  now  formulate  the  Stone’s  theorem: 

Theorem  B.l  (Stone,  1937)  Let  X  be  a  compact  metric  space,  C[X]  the 
set  of  continuous  functions  defined  on  X,  and  A  a  subalgebra  of  C[X]  with 
the  following  two  properties: 

1.  the  function  /(x)  =  1  belongs  to  A; 

2.  for  any  two  distinct  points  x  and  y  in  X  there  is  a  function  f  6  A  such 
that  f(x)  *  f{y). 

Then  A  is  dense  in  C[X). 

As  a  simple  application  of  this  theorem  we  consider  the  set  of  gaussian  su¬ 
perpositions,  defined  as 


Qx  =  {/  €  C[X]  |  /(x)  =  £cte'i2^11', *  C  R?,ti  €  Rd,ci,<ri  €  R,m  €  N} 

1=1 


We  can  now  enunciate  the  following: 


(10) 
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Proposition  B.l  The  set  Qx  is  dense  in  C[X\,  where  X  is  a  compact  subset 
of  Rd. 

Proof:  In  order  to  use  Stone’s  therorem,  we  first  have  to  show  that  Qx 
is  a  subalgebra  of  C[X],  for  each  compact  subset  X  of  Rd.  The  set  Qx 
will  be  a  subalgebra  of  C[X]  if  the  product  of  two  of  its  elements  yields 
another  element  of  Qx-  Since  Qx  is  a  linear  superposition  of  gaussians  of 
different  variance  and  centered  on  different  points  it  is  sufficient  to  deed  with 
the  product  of  two  gaussians.  From  the  identity  below  it  follows  that  the 
product  of  two  gaussians  centered  on  two  points  tx  and  t2  is  proportional  to 
a  Gaussian  centered  on  a  point  t3  that  is  a  convex  linear  combination  of  tj 
and  t2.  In  fact  we  have: 


77 —  77 


=  ce  i 


t3  = 


o\ ti  4-  <T|  t2 
<r\  +  a\ 


2  2 
a\  a2 


<T{  + 


2’ 


C  •* 


The  function  /(x)  =  1  belongs  to  Qx,  since  it  can  be  considered  as  gaussian 
of  infinite  variance,  and  for  any  distinct  points  x,  y  we  can  obviously  find  a 
function  in  Qx  such  that  /(x)  ^  /( y):  the  conditions  of  Stone’s  theorem  are 
then  satisfied  and  Qx  is  dense  in  C[X)  □. 


C  Regularization  networks  can  approximate 
smooth  functions  arbitrarily  well 

In  this  appendix  we  briefly  describe  the  regularization  method  for  approxi¬ 
mating  functions  and  show  that  the  networks  that  are  derived  from  a  reg¬ 
ularization  principle  can  approximate  arbitrarily  well  continuous  functions 
defined  on  a  compact  subset  of  if*. 

Let  5  =  {(x,-,  y,)  €  Rn  x  i2|i  =  1  ,—N}  be  a  set  of  data  that  we  want  to  ap¬ 
proximate  by  means  of  a  function  /.  The  regularization  approach  (Tikhonov, 
1963;  Tikhonov  and  Arsenin,  1977;  Morozov,  1984;  Bertero,  1986)  consists 
in  computing  the  function  /  that  minimizes  the  functional 
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m = d*  -  /<*•»’ + aii^/h1 

*m  1 

where  P  is  a  constraint  operator  (usually  a  differential  operator),  ||  •  ||2  is 
a  norm  on  the  function  space  to  whom  Pf  belongs  (usually  the  L2  norm) 
and  A  is  a  positive  real  number,  the  so  called  regularization  parameter.  The 
structure  of  the  operator  P  embodies  the  a  priori  knowledge  about  the  solu¬ 
tion,  and  therefore  depends  on  the  nature  of  the  particular  problem  that  has 
to  be  solved.  The  general  form  of  the  solution  of  this  variational  problem  is 
given  by  the  following  expansion  (Poggio  and  Girosi,  1989): 

/(*)  =  £c,G(x;x<)  +  p(x)  (11) 

t=i 

where  G  is  the  Green’s  function  of  the  differential  operator  PP,  P  being  the 
adjoint  operator  of  P,  p(x)  is  a  linear  combination  of  functions  that  span  the 
null  space  of  P,  and  the  coefficients  Cj  can  be  found  by  inverting  a  matrix  that 
depends  on  the  data  points  (Poggio  and  Girosi,  1989).  We  remind  the  reader 
that  the  Green’s  function  of  an  operator  PP  is  the  function  that  satisfies  the 
following  differential  equation  (in  the  distributions  sense): 

PP  G{x;y)  =  6(x-y)  .  (12) 

It  is  clear  that  there  is  a  correspondence  between  the  class  of  functions  that 
can  be  written  in  the  form  (11)  (for  any  number  of  data  points  and  for  any 
Green’s  functions  G  of  a  self-adjoint  operator)  and  a  subclass  of  feedforward 
networks  with  one  layer  of  hidden  units,  of  the  type  shown  in  figure  1.  Un¬ 
der  mild  assumptions  on  PP,  these  networks  can  approximate  continuous 
functions  arbitrarily  well,  as  is  stated  in  the  following  proposition: 

Proposition  C.l  For  every  continuous  function  F  defined  on  a  compact 
subset  of  BT  and  every  piecewise  continuous  G  which  is  the  Green's  func¬ 
tion  of  a  self-adjoint  differential  operator,  there  exists  a  function  /’(x)  = 
c<G(x;  Xi),  such  that  for  all  x  and  any  positive  e  the  following  inequality 
holds: 


\F(x)  -  /•(*) I  <  « 
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Proof:  Let  F  be  a  continuous  function  defined  on  a  compact  set  D  C  Rn. 
Its  domain  of  definition  can  be  extended  to  all  Rn  by  assigning  zero  value  to 
all  points  that  do  not  belong  to  D.  The  resulting  function,  that  we  still  call 
F,  is  a  continous  function  with  bounded  support3.  Consider  the  space  K  of 
“test  functions”  (Gelfand  and  Shilov,  1964),  that  consists  of  real  functions 
<f>(x)  with  continuous  derivatives  of  all  orders  and  with  bounded  support 
(which  means  that  the  function  and  all  its  derivatives  vanish  outside  of  some 
bounded  region).  As  Gelfand  and  Shilov  show  (Appendix  1.1),  there  always 
exists  a  function  ^(x)  in  K  arbitrarily  close  to  F,  i.e.  ,  such  that  for  all  x 
and  for  any  e  >  0, 


|F(x)  -  *(x)|  <  e. 

Thus  it  is  sufficient  to  show  that  every  function  <f>{x)  €  K  can  be  ap¬ 
proximated  arbitrarily  well  by  a  linear  superposition  of  Green’s  functions 
(function  /*  of  proposition  C.l). 

We  start  with  the  identity 

*(x)  =  J  dy<f>(y)6(x-y)  (13) 

where  the  integral  is  actually  taken  only  over  the  bounded  region  in  which 
<£(x)  fails  to  vanish.  By  means  of  equation  12  we  obtain 

=  j  dyMy){PPG)(x-,y)  (14) 

and  since  4>{x)  is  in  K  and  PP  is  formally  self-adjoint  we  have 

«x)  =  J  dyG(x-,y)(PP*)(y).  (15) 

We  can  rewrite  equation  15  as 

<£(*)  =  J  dyG(x\y)ip(  y)  (16) 

where  ij>(x)  =  PP<t>(x).  Since  G(x;y)0(y)  is  piecewise  continuous  on  a 
closed  domain,  this  integral  exists  in  the  sense  of  Riemann.  By  definition  of 
Riemann  integral,  equation  16  can  then  be  written  as 

3The  support  of  a  continuous  function  F(x)  is  the  closure  of  the  set  on  which  F(x)  £  0. 
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(17) 


#x)  **  A"  £  il>(Xk)G(x-,  xk)  +  Ex( A) 

*€/ 

where  x*  are  points  of  a  square  grid  of  spacing  A,  /  is  the  finite  set  of 
lattice  points  where  0(x)  ^  0,  and  £X(A)  is  the  discretization  error,  with 
the  property 


Hm£*(A)  =  0.  (18) 

If  we  now  choose  /*(x)  *=  An  £*«/  ^(x*)Cr(x;  x*),  combining  equation  18 
and  equation  17  we  obtain 


lhn[#x)  -  /*(x)]  =  0.  (19) 

Thus  every  function  4>  €  K  can  be  approximated  arbitrarily  well  by  a  lin¬ 
ear  superposition  of  Green’s  functions  G  of  a  self-adjoint  operator,  and  this 
concludes  the  proof  □. 

Remark  The  conditions  of  proposition  C.l  exclude  Green’s  functions 
that  have  singularities  in  the  origin.  An  example  is  the  Green’s  function 
associated  with  the  “membrane”  stabilizer  P  =  V  in  2  or  more  dimensions. 
In  2  dimensions,  the  membrane  Green’s  function  is  G(r)  as  —logr,  where 
r  =  ||x-Xi[|  (in  1  dimension  G(x)  =  |x|,  satisfies  the  conditions  of  proposition 
C.l). 

Remark  Notice  that  in  order  to  approximate  arbitrarily  well  any  conti- 
nous  function  on  a  compact  domain  with  functions  of  the  type  11,  it  is  not 
necessary  to  include  the  term  p  belonging  to  the  null  space  of  P. 
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